| Predictor | Coverage | Estimate | t-value | Inclusion (%) | Bias | MSE |
|---|---|---|---|---|---|---|
| W | 0.719 | 0.117 | 0.264 | 16.94 | -1.335 | 4.470 |
| X | 0.968 | 1.179 | 2.595 | 77.83 | -0.325 | 1.492 |
| Z | 0.963 | 2.064 | 5.400 | 98.79 | 0.562 | 1.845 |
Berk, Brown and Zhao (2010)
Classic regression analysis treats predictors as fixed or non-random.
Allows estimation of unbiased coefficients.
Berk and colleagues (2010) claim Model Selection is ubiquitous in Criminology.
When the correct model is unknown…
However, what happens when the same dataset is used?
Sampling distributions become distorted
Estimates are biased (e.g., Beta coefficients, standard errors)
Overconfidence in results (inflated Type 1 error)
Direct Censoring
Indirect Censoring
Alterations in dispersion of regression parameter estimates
Forward stepwise regression using AIC applied to 10,000 samples of \(n\) = 200.
At each step, the term is added that leads to the model with the smallest AIC. The procedure stops when no remaining regressor improves the AIC.
\[ \begin{bmatrix} w \\ x \\ z \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 5.0 & 4.0 & 5.0 \\ 4.0 & 6.0 & 5.0 \\ 5.0 & 5.0 & 7.0 \end{bmatrix} \right) \]
The fixed effects coefficients are \(\beta_0\) = 3.0, \(\beta_1\) = 0.0, \(\beta_2\) = 1.0, and \(\beta_3\) = 2.0.
The error term follows a normal distribution \(\varepsilon \sim \mathcal{N}(0, 10.0)\).
reps <- 10000
p <- 3
Sigma <- matrix(c(5,4,5,
4,6,5,
5,5,7), p, p)
n <- 200
betas <- c(3, 0, 1, 2)
rsq <- NULL
coefs <- cover <- matrix(NA, nrow = reps, ncol = 3)
colnames(coefs) <- c("w", "x", "z")
colnames(cover) <- c("w", "x", "z")
for (i in seq(reps)) {
X <- MASS::mvrnorm(n = n, rep(0, 3) , Sigma)
y <- as.numeric(cbind(1, X) %*% betas + rnorm(n, 0, 10))
Xy <- as.data.frame( cbind(X, y))
colnames(Xy) <- c(c("w", "x", "z"), "y")
fit <- lm(y ~., data = Xy)
sel <- step(fit, k = 2, trace = FALSE)
s <- summary(sel)
tvals <- s$coefficients[,3][-1]
coefs[i, names(tvals)] <- tvals
rsq[i] <- s$r.squared
}| None | W | X | Z | WX | WZ | XZ | WXZ | |
|---|---|---|---|---|---|---|---|---|
| Berk et al. | 0 | 0 | 0.0001 | 17.4 | 1.0 | 4.9 | 65.7 | 10.8 |
| Replication | 0 | 0.02 | 0.03 | 17.8 | 1.16 | 4.35 | 65.23 | 11.41 |
The \(R^2\)s varied over the simulations between about .3 and .4.
For \(X\) the post-model selection t-values distribution has a greater mean (2.6–2.2) and a smaller standard deviation (.79–1.0).
For \(Z\) the mean and the standard deviation are biased substantially upward: from 4.9 to 5.5 for the mean and from 1.0 to 2.3 for the standard deviation.
| Predictor | Coverage | Estimate | t-value | Inclusion (%) | Bias | MSE |
|---|---|---|---|---|---|---|
| W | 0.719 | 0.117 | 0.264 | 16.94 | -1.335 | 4.470 |
| X | 0.968 | 1.179 | 2.595 | 77.83 | -0.325 | 1.492 |
| Z | 0.963 | 2.064 | 5.400 | 98.79 | 0.562 | 1.845 |
\[ \text{Bias} = \frac{1}{S}\sum^S_{s=1}{\left( \hat{\beta}_{p,s} - \beta_p\right)}, \quad \text{MSE} = \frac{1}{S}\sum^S_{s=1}{\left( \hat{\beta}_{p,s} - \beta_p\right)^2} \]
Red curve/ solid line = Conditional on preferred model being known
Blue curve/ broken line = Conditional on predictor being included in a model
Red curve/ solid line = Conditional on preferred model being known
Blue curve/ broken line = Conditional on predictor being included in a model
For the preferred model, power to reject \(H_0: \beta_2=0\) with \(\alpha =0.05\) is approximately 60%.
After model selection, that probability is about 76%.
Bias due to model selection artificially inflates power in about 27%.
Selecting the preferred model (\(X\) and \(Z\) included) does not guarantee any of the desirable properties of the regression coefficient estimates.
The post-model-selection sampling distribution can take on a wide variety of shapes, even when the researcher happens to have selected the model accurately representing how the data were generated.
Inference from these models can lead to biased regression parameter estimates and misleading statistical tests and confidence intervals.
The particular selection procedure used does not materially matter.
Split sample in training and test samples (😦)
Collect two random samples (😨)
Derive a theoretically based appropriate model (😰)
Differentiate between confirmatory and exploratory analysis (🤯)
Should all else fail, forego formal statistical inference altogether (☠️)
Try my ShinyApp!
Statistical Inference After Model Selection